The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics

Abstract Background Microbial culture collections play a key role in taxonomy by studying the diversity of their strains and providing well-characterized biological material to the scientific community for fundamental and applied research. These microbial resource centers thus need to implement new standards in species delineation, including whole-genome sequencing and phylogenomics. In this context, the genomic needs of the Belgian Coordinated Collections of Microorganisms were studied, resulting in the GEN-ERA toolbox. The latter is a unified cluster of bioinformatic workflows dedicated to both bacteria and small eukaryotes (e.g., yeasts). Findings This public toolbox allows researchers without a specific training in bioinformatics to perform robust phylogenomic analyses. Hence, it facilitates all steps from genome downloading and quality assessment, including genomic contamination estimation, to tree reconstruction. It also offers workflows for average nucleotide identity comparisons and metabolic modeling. Technical details Nextflow workflows are launched by a single command and are available on the GEN-ERA GitHub repository (https://github.com/Lcornet/GENERA). All the workflows are based on Singularity containers to increase reproducibility. Testing The toolbox was developed for a diversity of microorganisms, including bacteria and fungi. It was further tested on an empirical dataset of 18 (meta)genomes of early branching Cyanobacteria, providing the most up-to-date phylogenomic analysis of the Gloeobacterales order, the first group to diverge in the evolutionary tree of Cyanobacteria. Conclusion The GEN-ERA toolbox can be used to infer completely reproducible comparative genomic and metabolic analyses on prokaryotes and small eukaryotes. Although designed for routine bioinformatics of culture collections, it can also be used by all researchers interested in microbial taxonomy, as exemplified by our case study on Gloeobacterales.

Background Microbial culture collections play a key role in taxonomy by studying the diversity of their strains and providing well characterized biological material to the scientific community for fundamental and applied research. These microbial resource centers thus need to implement new standards in species delineation, including whole-genome sequencing and phylogenomics. In this context, the genomic needs of the Belgian Coordinated Collections of Microorganisms (BCCM) were studied, resulting in the GEN-ERA toolbox. The latter is a unified cluster of bioinformatic workflows dedicated to both bacteria and small eukaryotes (e.g. yeasts).

Findings
This public toolbox allows researchers without a specific training in bioinformatics to perform robust phylogenomic analyses. Hence, it facilitates all steps from genome downloading and quality assessment, including genomic contamination estimation, to tree reconstruction. It also offers workflows for average nucleotide identity comparisons and metabolic modeling.
Technical details Nextflow workflows are launched by a single command and are available on the GEN-ERA GitHub repository (https://github.com/Lcornet/GENERA). All the workflows are based on Singularity containers to increase reproducibility.

Testing
The toolbox was developed for a diversity of microorganisms, including bacteria and fungi. It was further tested on an empirical dataset of 18 (meta)genomes of earlybranching Cyanobacteria, providing the most up-to-date phylogenomic analysis of the Gloeobacterales order, the first group to diverge in the evolutionary tree of Cyanobacteria.

Conclusion
The GEN-ERA toolbox can be used to infer completely reproducible comparative genomic and metabolic analyses on prokaryotes and small eukaryotes. Although designed for routine bioinformatics of culture collections, it can also be used by all researchers interested in microbial taxonomy, as exemplified by our case study on Gloeobacterales.

REVIEWER 1
Reviewer reports: Reviewer #1: Paper Title: The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics The GEN-ERA toolbox provides a number of containerized workflows to researchers (without any specific training in bioinformatics) to study the diversity of wellcharacterized strains for fundamental and applied research. More specifically It facilitates all steps from genome downloading and quality assessment, including genomic contamination estimation, to tree phylogenetic reconstruction. It additionally provides workflows for average nucleotide identity comparisons and metabolic modeling.
The supplementary file provides details of how to run the whole workflow (through 10 steps), found in the GEN-ERA toolbox on basal, for an empirical dataset of early emerging cyanobacteria. It provides an up-to-date phylogenomic analysis of the Gloeobacterales order, the first group to diverge in the evolutionary tree of Cyanobacteria. The github repo located at https://github.com/Lcornet/GENERA also provides more details about the GEN-ERA tools suite. Though in the manuscript it is mentioned that the call to Mantis could not be included in the Singularity call, on the github repo they have indicated that Mantis is now installed in a singularity container for the Metabolic workflow (install is no longer necessary).
>The problem with Mantis was the accession to the database outside the singularity container, it was impossible to connect Mantis within a container. During the review process, we took the decision to install the database inside the container (available from the DOX page), making the installation of Mantis by users not necessary anymore, and resolving the connection issue. This has been changed in the manuscript.
Line 202 of the tracked change document.
The tool has been tested on an empirical dataset of 18 (meta)genomes of earlybranching Cyanobacteria and the time taken as well as the results of the run are documented in the supplementary file. The authors claim that the tools suite can be used to study the diversity of microorganisms, including bacteria and fungi. From the github repo, it is clear that a number of publications in high-impact journal papers have already resulted from the development of the GEN-ERA.
1) Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? This study aims at describing a toolbox, named GEN-ERA, and the methods section defines the various steps of the tools suite. Looking at the supplementary file and the github, it is easy to follow the manuscript. The versions of the programs used in the case study are provided in the forms of nextflow scripts. >Thank you.
2) Are the conclusions adequately supported by the data shown?
The results of running the tools suite on an empirical dataset of 18 (meta)genomes of early-branching Cyanobacteria, at each step, as well as the time taken to download the files and the running each step, are convincing that it works fine, at least for Cyanobateria. But this is found in the Supplementary Material. There should be section on Discussion and Conclusion in the main text.
>Thank you for this suggestion. A paragraph at the end of the case study has been added to emphasize and summarize the usage of the GEN-ERA toolbox. The format of the paper, technical note, did not authorize a conclusion within the main text. In consequence, we choose to add this paragraph at the end of the finding section.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Line 276-294 of the tracked change document.
3) Please indicate the quality of language in the manuscript. Does it require a heavy editing for language and clarity? But t The use of English language is adequate and concise and can be understood clearly, by researchers interested in studying diversity of micro-organisms. 4) Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used?
The statistics involved in the phylogenetic analyses are integrated in the existing programs. Hence I am not able to assess the statistics.
>Noteworthy, we did not design new statistical analyses, but report analyses provided by the programs used by the toolbox.

5) Final Comments
The proposed toolbox/toolsuite described in this manuscript is very relevant and worth a read for researchers interested in studying the diversity of microorganisms, including bacteria and fungi, especially as it helps to facilitate their life through the use of welldefined containerized NextFlow workflows.
>Thank you I strongly believe that there should be a section on the Discussion of the results of running the toolbox for the case study and a Conclusion in the main manuscript. This will help readers in understanding the importance of the toolbox better.
>A paragraph has been added at the end of the finding section, see previous comment.
Line 276-294 of the tracked change document.

REVIEWER 2
Reviewer #2: Cornet et al have generated a collection of NextFlow pipelines which provide a pipeline to analyse data associated with genome or raw sequencing data of microbial organisms and protists. The methodology appears sound and reproducible. My main concern with the manuscript is that it is not well described in the abstract, introduction or GitHub repository. It isn't clear whether the analyses are specific for genomics questions arising from culture collections, or if it is more broadly applicable. There is also no discussion about other pipelines which achieve similar things e.g. ATLAS https://metagenome-atlas.github.io/ >A paragraph at the end of the case study has been added to emphasize and summarize the usage of the GEN-ERA toolbox. A sentence has been added to emphasize that our toolbox is designed for comparative genomics of both bacteria and small eukaryotes, which was not the purpose of other pipelines. Although the toolbox was developed for culture collection, it can be used on any genomic data, as it is showed by our case study. A sentence on this subject has also been added into this paragraph.
Line 276-294 of the tracked change document.
I also had a number of minor concerns, detailed below.
A number of grammatical errors detected, these should be fixed. Parts of the manuscript are also slightly too informal e.g. "This confirms the interest of 221using ORPER to spot interesting SSU rRNA sequences" >The sentence has been deleted.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
It would be helpful if the GitHub front page could provide a concise description of what the software aims to achieve, to make its use more understandable.
>A description of the toolbox has been added at the GitHub front page.
"The GEN-ERA toolbox is a suite of Nextflow-Singularity workflows designed for comparative genomics of bacteria and small eukaryotes. Without any installation, it allows researchers to download, assemble and bin (meta)genomes (from short or long reads). Orthologous inference and maximum likelihood phylogenomic analyses (bootstrap and jackknife) can be inferred with this suite. Constrained (by a ribosomal phylogenomic) SSU rRNA phylogeny can also be inferred. Average nucleotide identity, GTDB identification and metabolic modelling are also included in the toolbox." 106: "as it happened" grammatical error >The sentence has been modified "Assembly.nf" Commonly assembly is a separate process to binning, but here binning has been included. Perhaps a clearer name might be Genome-recovery.nf ?
>The workflow was originally created for genome assembly only and then we had more option like the binning. It was more useful for users to have all these topics into one single workflow. Nevertheless, we did not choose to change the names of the tool because it was widely used under its former name of Assembly.nf. It is indeed used on several ongoing project/papers and it would be difficult to change now.
124: "Researchers interested in a better understanding of these tools can read the recent review on the detection of genomic contamination made by Cornet et al. [15]." While not inappropriate, this is perhaps too much self-citation.
>The sentence has been deleted Why is contamination assessed but not completeness?
>The completeness is also estimated, it has been added in the manuscript.
Line 131 and 143 of the tracked change document.
129: "annotation of bacterial proteins is automatic" Automatic in what sense? Annotation also refers to describing the function of the protein usually, but here the meaning appears to be restricted to ORF calling. I found this somewhat confusing. Also "in the different GEN-ERA workflows" is unclear -does this mean that prodigal is run as part of the Assembly.nf workflow for instance?
>The annotation here means prediction of proteins. We have now specified this in the manuscript. We also added the names of the workflows where the bacterial protein prediction is included.
Line 146-148 of the tracked change document.
143: "Orthology.nf automatically provides the core genes, shared by all the organisms in unicopy" what is meant by "all organisms" here?
>It is user dependent. This can be all the organisms provided to orthology.nf or the user can choose to exclude the outgroup for instance. We added new options concerning this to this workflow and explained them in the wiki. In the text, we replaced "all organisms" by "all genomes provided by the user".
Line 161 of the tracked change document.
145: "The OGs of proteins 145 can be further enriched" what does "enriched" mean?
>By "enriched", we mean to add orthologous sequence to an OG, without having to run a new orthologous inference. It is now specified in the manuscript.
Line 164 of the tracked change document.
163: GTDB.nf is described in the "Other workflows" section, when it is phylogenyrelated.
>We only use GTDB in the toolbox to classify genomes, which is more taxonomy related. This is why it is treated in "Other workflows". It is now specified in the manuscript.
Line 195 of the tracked change document.
172: "it was 173 technically not possible to include Mantis in a container" I am curious as to why this was the case? I do not have any specific insight or ability to judge the accuracy of this statement, just curious. Inclusion of a sentence describing the difficulties might help other workflow developers and/or the Mantis developers.
The problem with Mantis was the accession to the database outside the singularity container, it was impossible to connect Mantis within a container. During the review process, we took the decision to install the database inside the container (available from the DOX page), making the installation of Mantis by users not necessary anymore, and resolving the connection issue. This has been changed in the manuscript.
Line 202 of the tracked change document.
190: "Gloeobacterales are the most basal order of the 191 Cyanobacteria phylum" This statement is somewhat controversial, because the GTDB has defined the Melainobacteria as being a part of the Cyanobacteria phylum based on RED values. I would suggest removing "the most basal" or making it clear that cyanobacteria refers to photosynthetic cyanobacteria rather than the phylum.
>Indeed, this can be controversial. We now specify photosynthetic cyanobacteria.
Line 225 of the tracked change document.
189: The methods for this section are not described in the methods section. They are only briefly described in the Findings section. A clearer link to these methods should be made from the maintext and methods.
>A new section has been added to the methods to describe the case study. >It is the presence and localization of genomes among the SSU rRNA diversity. We add this definition in the manuscript.
Line 251 of the tracked change document.
224: Our results demonstrate the absence of one metabolic 225pathway" There are many metabolic pathways, presumably it is missing more than one.  The second tool, Assembly.nf, is dedicated to genome production. This workflow can 116 assemble genomes and metagenomes, not only from Illumina short reads but also PacBio or 117 Nanopore long reads data, thanks to the use of SPAdes The GEN-ERA toolbox was initially tested by the users from the BCCM involved in the GEN-188 ERA project, who were thus considered as beta testers, on a SLURM-operated HPC system 189 (durandal2/nic5, CÉCI-ULiège). These users were not advanced bioinformatics researchers 190 and the user guide was developed based on their needs to ensure an easy-to-use toolbox. 191 This toolbox was further tested on the Gloebacterales order (Cyanobacteria) as a case study. 192 All command lines used for this test case are provided in Supplemental Note 1. 193 Gloeobacterales as a case study showed that the toolbox can be used for any comparative genomics of microorganisms, using 241 genomic or metagenomic (public) sequencing data. Indeed, it allowed to re-assemble 242 metagenomes, and to make the binning (the latter was deleted from NCBI servers). Using the 243 toolbox, public genomes were also downloaded and their quality estimated, notably the 244 genomic contamination. The inference of core genes from these genomes was performed Orthology.nf can compute (optional) core genes. Core genes are considered here as unicopy 332 genes shared by all organisms (and only these organisms) of a user-specified list, without 333 exception. Another option allows the user to determine the specific genes, considered here as 334 genes specific to a sub-list of organisms, without intruders. The main difference with core 335 genes is that specific candidate OGs will undergo an orthologous enrichment by mining the 336 genomes of all the organisms of the orthologous inference. This strategy is used in our 337 analyses of the Snodgrassella-specific gene content [85]